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1.  Architectural  Analysis 


This  work  compared  computer  architectures  by  measuring  the  execution  of  an  identical  set  of  high  level 
language  programs.  Comparative  studies  arc  difficult  and  expensive  as  they  require  an  environment  in  which 
all  the  architectures  can  be  analyzed  on  a  common  basis.  Simulation  has  been  used,  but  die  slow  speed  makes 
it  prohibitively  long  to  collect  a  significant  sample.  Performance  measures,  such  as  the  number  of 
instructions,  reflect  not  only  architectural  differences  but  factors  (such  as  compilers)  not  related  to  die 
architecture. 

The  instruction  streams  of  the  IBM  S/370,  DEC  PDP-11,  and  P-codc  machines  were  measured  using  a 
microprogrammable  proccssor-Emmy.  The  measurement  mechanism  is  embedded  into  die  interpreter  (an 
emulator)  for  the  machine,  and  has  access  to  all  aspects  of  the  instruction  execution.  The  DEC  VAX 
instruction  stream  was  measured  on  a  VAX  11/780  using  a  trace  feature  in  the  architecture.  A  set  of 
FORTRAN  programs  was  used  for  measurements,  and  reflect  a  scientific  work  load. 

The  analysis  first  studied  the  composition  of  die  instruction  stream.  Total  number  of  instructions  executed, 
shows  the  VAX  architecture  to  be  most  efficient,  but  measures  of  the  activity  necessary  by  the  interpreter 
indicate  that  the  S/370  representation  is  fastest  to  intciprct.  Memory  reference  behavior  indicated  that  the 
8-bit  displacement  used  by  die  VAX  is  very  effective  for  local  referencing,  but  VAX  suffers  in  referencing 
global  objects.  Measurements  of  branch  behavior  have  shown  that  the  PDP-11  and  VAX  architectures 
require  10-15%  more  branch  instruction  than  an  ideal  representation  of  die  program  would  indicate.  This 
architectural  defect  results  from  the  short  range  of  conditional  branch  instructions. 

This  work  analyzed  the  interaction  between  compiler  optimization  techniques  and  the  instruction  streams  diat 
result  from  optimization.  Six  S/370  compilers  generated  different  representations  of  the  test  work  load,  and 
produced  the  data  base  for  study  of  high  level  language  behavior  and  architectural  analysis.  Optimization, 
while  reducing  the  resource  demands  of  a  program,  docs  not  apply  uniformly  to  all  aspects  of  instruction 
execution.  The  fixed-point  computation  and  memory  reference  demands  arc  greatly  reduced,  but  the  control 
requirements  of  a  program  arc  largely  unaffected.  Because  die  absolute  occurrence  of  control  related 
instructions  is  constant,  their  relative  frequency  increases  with  optimization. 
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2.  Directly  Executed  Languages 


A  natural  way  to  make  compilation  as  straightforward  as  possible  is  to  make  the  execution  architecture  fit  the 
high-level  language.  This  work  centers  on  a  family  of  execution  architectures  called  directly  executed 
languages,  or  DHLs.  High-level  language  statements  arc  closely  represented  by  DEL.  instructions.  DEL 
representations  minimize  the  number  of  bits  needed  in  the  instruction  stream  for  operand  specification, 
without  resorting  to  encodings  that  require  knowledge  of  the  frequency  of  occurrence  of  individual  operands. 
A  Pascal-to-DEL  compiler  and  a  DEL  processor  emulator  are  used  to  measure  the  number  of  instructions 
required  to  run  five  test  programs  to  completion.  The  number  is  also  measured  for  the  HP  1000E,  the  IBM 
370,  Pascal  P-codc  and  the  DEC  Vax.  The  average  of  the  ratios  of  the  numbers  for  these  to  the  number  for 
Adept  is  3.46.  The  number  of  main  memory  bytes  read  for  data  is  5.42  times  that  for  Adept,  and  the  ratio  for 
bytes  written  is  14.73.  These  results  show  that  it  is  possible  to  make  an  execution  architecture  suitable  for  a 
high-level  language  in  a  way  that  results  in  architectural  measures  that  may  indicate  a  higher  speed  of 
execution  and  a  lower  cost  of  implementation  than  some  familiar  architectures. 


3.  Concurrent  Execution 

The  execution  time  of  instruction  can  be  effectively  reduced  by  overlapping  die  start  of  the  execution  of  one 
instruction  widi  die  end  of  the  execution  of  another.  This  process  of  overlapping  instruction  execution  is 
called  pipelining  and  it  is  used  on  all  "super  computers".  An  example  of  an  instruction  stream  which  has  been 
pipelined  is  shown  in  Figure  3-1. 

I  IF  I  DC  I  AG  1  OF  1  EX  1 

I  IF  I  DC  I  AG  I  OF  1  F.X  I 

I  IF  1  DC  1  AG  1  OF  I  EX  I 

1  IF  1  DC  1  AG  I  OF  I  P,X  I 

Figure  3-1:  Pipelined  Execution 

Extending  the  concept  of  pipelining  to  its  ultimate  would  generate  an  execution  sequence  where  all 
instructions  execute  at  the  same  time.  Rut  since  simultaneous  execution  of  all  instructions  would  mean  that 
inputs  arc  fetched  at  the  same  time  and  results  arc  produced  at  the  same  time,  simultaneous  execution  could 
only  be  performed  correctly  if  no  instruction  in  the  task  required  die  completion  of  any  other  instruction  to 
perform  its  function.  Since  this  is  only  possible  in  the  most  trivial  of  eases,  the  correct  execution  of  all  other 
programs  can  only  occur  when  instructions  wait  for  other  instnictions  to  execute  producing  results  which  are 
needed  for  their  correct  execution.  An  example  of  such  an  ultimate  pipeline  is  illustrated  in  Figure  3-2. 
Executing  multiple  instructions  at  a  time  will  be  called  concurrent  execution  of  instructions  as  opposed  to 
parallel  execution  which  usually  refers  to  a  single  instruction  stream,  multiple  data  stream  style  of 
computation. 

There  arc  different  degrees  to  which  concurrent  execution  of  instruction  streams  can  take  place.  Using  a 
minimal  amount  of  hardware,  a  small  amount  of  concurrency  can  be  detected  providing  a  modest  increase  in 
execution  speed.  The  amount  of  concurrency  detected  in  diese  schemes  can  be  compared  to  die  serial 
speedup  of  traditional  machines  which  have  instruction  prefetch  but  little  or  no  pipelining.  Introducing  more 
hardware  increases  the  amount  of  concurrency  which  can  be  detected  and  subsequently  the  execution  speed 
of  the  task.  Concurrency  such  as  this  can  be  compared  to  a  highly  pipelined  machine  which  allows  out  of 
order  execution,  multiple  path  exploration,  and  interleaved  memory  traffic. 
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Figure  3-2:  The  Ultimate  Pipeline 


Although  the  degrees  of  concurrency  detection  can  actually  be  thought  of  as  a  continuum,  four  distinct  levels 
have  been  defined  in  our  research.  These  levels  arc  defined  so  that  increasing  level  numbers  increase  the 
amount  of  concurrency  detected  with  a  corresponding  increase  in  the  amount  of  difficulty  of  detection  and 
subsequently  the  amount  of  hardware  needed  for  implementation.  They  include: 

3.1  Level  0  -  Pipelined  Execution 

The  main  feature  of  Level  0  concurrency  detection  is  that  there  is  no  concurrency  actually  detected  at  all. 
Level  0  concurrency  is  characterized  by  the  fastest  program  execution  possible  with  the  single  restriction  that 
an  average  maximum  of  one  instruction  is  executed  in  each  machine  cycle. 

3.2  Level  1  -  Transparent  Concurrency 

Level  1  concurrency  is  characterized  by  a  direct  examination  and  exploitation  of  the  concurrency  which 
existed  in  the  original  task.  In  this  level  of  concurrency,  only  the  concurrency  w  hich  was  explicitly  apparent 
in  the  task  is  detected. 

3.3  Level  2  -  Machine  Detectable  Concurrency 

Level  2  concurrency  is  marked  by  taking  advantage  of  all  the  concurrency  which  a  machine  can  detect  without 
resorting  to  algorithm  recoding  or  code  manipulation. 


3.4  Level  3  -  Algorithm  Detectable  Concurrency 

Level  3  concurrency  is  characterized  by  analyzing  the  job  to  be  done  and  restructuring  it  in  hardware  and 
software  to  produce  a  representation  which  will  execute  in  a  minimal  amount  of  time  using  a  minimum 
number  of  steps.  More  precisely.  Level  3  concurrency  is  all  concurrency  which  can  be  detected 
algorithmically,  'litis  detection  process  can  occur  at  both  the  hardware  and  software  level. 

We  have  completed  a  fairly  extensive  study  of  the  speedup  potential  at  each  of  the  above  levels.  One  such 
studyVound  the  concurrency  available  in  a  sample  DEL  program,  shown  in  3-3.  Note:  level  3a  is  compile 
time  detection  only  while  level  3b  represents  both  compile  and  runtime  detection. 


Level  of  Concurrency 
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1 

2 

3a 

3b 

Dynamic  Number 
of  Instructions 

435 

435 

435 

390 

390 

Machine  Cycles 
to  Execute  Task 

435 

306 

180 

121 

93 

Speedup 

1.00 

1.42 

2.42 

3.22 

4.19 

Figure  3-3:  Ratios  for  Different  Levels  of  Concurrency 
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